Accelerating Dust Temperature Calculations with Graphics Processing Units 



Patrik Jonsson, Joel R. Primack 

Santa Cruz Institute for Particle Physics, University of California, Santa Cruz, CA 95064, USA 



Abstract 



ON 

o 
o 

(N 
O 

Q 

m 
(N 



When calculating the infrared spectral energy distributions (SEDs) of galaxies in radiation-transfer models, the calcu- 
lation of dust grain temperatures is generally the most time-consuming part of the calculation. Because of its highly 
parallel nature, this calculation is perfectly suited for massively parallel general-purpose Graphics Processing Units 
(GPUs). This paper presents an implementation of the calculation of dust grain equilibrium temperatures on GPUs in 
the Monte-Carlo radiation transfer code SUNRISE, using the CUDA API. The GPU can perform this calculation 69 times 
faster than the 8 CPU cores, showing great potential for accelerating calculations of galaxy SEDs. 

Key words: dust, radiative transfer, methods: numerical 
PACS: 95.30.Jx, 98.38.Ca, 98.58.Jg, 95.75.Pq 
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1. Introduction 

Over the last decade, radiation-transfer models of dust 
in galaxies have become increasingly capable, with higher 
spatial resolutions, higher wavelength resolutions, and more 
realistic descriptions of the dust itself (Gordon et al., 2001; 
Jonsson, 2006; Li ct al., 2008; Bianchi, 2008; Chakrabarti 
and Whitney, 2009; Jonsson ct al., 2009). Current codes 
calculate the temperatures of grains of a number of differ- 
ent compositions and sizes based on physical dust models 
(e.g. Weingartner and Draine, 2001; Draine and Li, 2007) 
at many different spatial locations in the problem, some- 
times also including thermal fluctuations. As the com- 
plexity of the dust models have gone up, researchers have 
found that the most computationally demanding part of 
the calculation no longer is the actual transfer of radiation 
through the medium, but rather the calculation of the dust 
emission spectrum. 

At the same time, graphics-processing units (GPUs) 
have evolved from specialty hardware to massively paral- 
lel general computation devices. These devices are made 
to render independent pixels, where the same computa- 
tion is being done millions of times on largely independent 
data. The predictable data access pattern means that the 
considerable amount of silicon real estate which is used 
to hide memory latency on a general-purpose CPU is un- 
necessary. Instead, the entire chip area can be used for 
computational units, giving a raw floating-point process- 
ing power many times larger than that of CPUs of similar 
cost and power consumption. 
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The utility of GPUs for general high-performance com- 
putations was initially limited by the fact that programs 
needed to be expressed in terms of graphics operations 
and written in shading languages such as Cg or GLSL, 
making it difficult to port programs. Still, astrophysical 
applications were explored, for example N-body calcula- 
tions (Portegies Zwart et al., 2007) with large increases 
in performance compared to custom GRAPE hardware. 
The programming obstacle was largely overcome with the 
Nvidia CUDA (Compute Unified Device Architecture, NVI- 
DIA Corporation, 2009b) API, which enables programs to 
be written in a language very similar to C/C++, with 
added directives. CUDA has been used in astrophysics 
not only for N-body calculations (Bcllcman ct al., 2008; 
Gaburov et al., 2009) but also for radio interferometer data 
processing (Ord et al., 2009) and exoplanet searches (Ford, 
2009), to name a few. 

This paper presents an implementation of the dust 
grain equilibrium temperature and emission SED calcu- 
lation in CUDA for the Monte-Carlo radiation transfer 
code SUNRISE (Jonsson, 2006; Jonsson et al., 2009). The 
CUDA version obtains a speedup of more than two orders 
of magnitude compared to a modern CPU core when using 
an Nvidia Tesla C1060 high-performance computing card. 
We start by outlining the calculation and the CUDA im- 
plementation in Sections 2-4, present accuracy tests and 
benchmarks in Section 5, and conclude with a discussion 
and summary in Sections 6 & 7. 

2. The Calculation of Dust Emission 

sunrise, a radiation-transfer code using the Monte- 
Carlo method to calculate the transfer of light through 
a dusty medium, has been described in previous papers. 
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The basic computational algorithm was described in Jon- 
sson (2006), and the computation of dust emission, along 
with results from an application of the code to hydro- 
dynamic simulations of spiral galaxies, in Jonsson et al. 
(2009). sunrise is free software and the source code, in- 
cluding the CUDA implementation used here, is available 
on the SUNRISE home page. 1 The calculation of the dust 
emission spectra, detailed in e.g. Misselt et al. (2001), is 
reviewed here to set the stage for implementing this cal- 
culation in CUDA. 

In the simplest calculation of emission from dust grains, 
each grain is simply assumed to emit like a modified black- 
body, the temperature of which can be found by equat- 
ing the emitted luminosity with the luminosity Lh heat- 
ing the grain (normally assumed to result from absorp- 
tion of shorter-wavelength radiation, but in principle from 
any source, for example heating by high-energy electrons). 
This leads to the equation 



L h =J a a {\)B{\, T e ) dA = 2hc 2 J 



Cq(A) 
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(1) 

which needs to be solved numerically to find the equi- 
librium temperature T e . In this formula, er a (A) is the 
(wavelength-dependent) absorption cross section of the dust 
grain, B{X,T) is the blackbody function, and c, h, and k 
are the speed of light and the constants of Planck and 
Boltzmann. Once the equilibrium temperature is known, 
the dust emission spectrum L\ e can be calculated as 



L\,eW — 0"a(A)-B(A, T e ) 



(2) 



Because a solution to Equation 1 is necessary for a 
number of dust grains of different sizes and compositions 
in a number of different grid cells with different intensities 
of the heating radiation, Equation 1 needs to be solved 
on the order of hundreds of millions of times to calculate 
the dust emission in one of the simulation snapshots to 
which SUNRISE is typically applied. To show this more ex- 
plicitly, the subscripts s and c are used to indicate that 
the quantities depend on the dust grain species (size and 
composition) and grid cell, respectively. The calculations 
necessary can then be summarized by the following equa- 
tions: 

L h - C , s = J I c (X)a a . s (\)dX (3) 

L\,e;c,sW = cr a; S (A)B(A,T e;C:S ) (5) 

The need for a numerical solution to Equation 4, using 
on the order of 1000 wavelength bins, and the fact that 
a global iterative procedure is necessary to find the equi- 
librium distribution of radiative intensities in cases where 



1 The SUNRISE home page is http : //sunrise . f amiljenjonsson. 
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the dust is optically thick to its own radiation (see Jons- 
son et al., 2009), means that calculating the dust emission 
involves the evaluation of at least 10 11 — 10 12 exponentials. 
Evaluating an exponential is one of the most computation- 
ally costly mathematical operations to perform on a mod- 
ern CPU, and this explains the large computational cost of 
the calculation. Indeed, in a typical sunrise run, the time 
for the dust emission calculation outweighs the time for the 
transfer of radiation by a large factor. (For the case used 
to produce the results in Section 5, the temperature cal- 
culation takes ~ 90% of the total time, and this generally 
increases further for cases with more complicated geome- 
try and higher optical depth.) Since operations with high 
floating-point intensity are good candidates for porting to 
a GPU due to their large advantage in raw floating-point 
power, this indicates that performing the calculation on 
the GPU has a large potential speedup. 

The seasoned computationist will at this point argue 
that this reasoning may be correct but ultimately irrele- 
vant, as the sensible way of calculating T e many times is 
by setting up a lookup table of T e - s (Lh- s ), thus bypassing 
the numerical solution of Equation 4. This is true (and 
is how the calculation is actually performed in sunrise), 
but proceeding with the GPU implementation is useful, 
for several reasons. First, as will be shown below, the ex- 
act calculation of L e (X) performed on the GPU is actually 
faster than a temperature table interpolation implemented 
on the CPU, a remarkable fact in itself. Second, the im- 
plementation of the calculation of equilibrium tempera- 
ture emission serves as a useful warm-up for implementing 
the calculation of grains with fluctuating temperatures (as 
outlined in e.g. Guhathakurta and Draine, 1989), which is 
even more computationally intensive. 

3. The CUDA Programming Model 

The CUDA programming model consists of functions, 
called kernels, that arc meant to execute on the GPU. 
When the host CPU starts a kernel, it is executed simul- 
taneously for a large number of threads. These threads 
are divided into blocks, where threads in the same block 
can communicate through a comparatively small amount 
of shared, high-speed memory. On the current generation 
of hardware (CUDA compute model 1.3), a block can con- 
sist of up to 1024 threads. Threads in different blocks 
cannot communicate and are completely independent. 

The actual hardware consists of an array of execu- 
tion units called Streaming Multiprocessors. Each mul- 
tiprocessor executes one (or several) thread blocks concur- 
rently. The set of threads that executes simultaneously, 
called a warp, on current hardware is 32. Within a warp, 
the threads execute one common instruction at a time, 
but with individual data. Out of the currently executing 
threads, the hardware schedules warps with zero overhead, 
so warps that are unable to execute because they are wait- 
ing for loads from the (uncached) global memory will yield 
the hardware to other warps that are ready to execute. 
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The performance concerns for writing GPU programs 
are enumerated in the CUDA "Best Practices Guide" (NVI- 
DIA Corporation, 2009a). Two of the most important 
concerns of an efficient GPU algorithm arc: first, mini- 
mizing access to the (uncached) global memory by stag- 
ing data in the fast shared memory to avoid unnecessary 
loads and making sure loads are coalesced. For a coalesced 
load, all threads in the two halves of a warp perform their 
loads in one memory transaction (unlike uncoalcsccd loads 
which require several - up to 16 - memory transactions 
and can thus take up to 16 times longer). The criteria 
for coalesced loads vary between device generations, and 
the CI 060 hardware used here can coalesce all loads where 
threads access memory within the same 128-byte segment 
(for loads of 4-byte data). Because of the large potential 
performance impact of uncoalesced loads, care has to be 
taken when designing the memory access pattern of the 
kernels. 

The second important concern is the hiding of the global 
memory latency by maintaining high concurrency (NVI- 
DIA Corporation, 2009b). If the number of concurrent 
warps executing is high enough, warps waiting for global 
memory can always be substituted for others that are 
ready to execute. The number of blocks that can exe- 
cute concurrently on a multiprocessor is determined by the 
number of processor registers and the amount of shared 
memory used by the kernel, so minimizing use of these 
resources is important for maintaining high concurrency. 

4. The CUDA Temperature Calculation 

With the above background, it is now clear that the 
dust temperature calculation is well suited for a GPU im- 
plementation. The calculation consists of millions of iden- 
tical, independent evaluations of Equations 3-5. Each 
CUDA thread will calculate the temperature and emission 
SED of a particular dust grain species in a particular grid 
cell. 

If the integrals in Equations 3-5 are replaced with 
the finite-difference equivalent sums that will actually be 
calculated, the result is 

Lh;c,s = y]lc,l^a;s,l AA; (6) 

I 

Lh ^ s = 2hc2 Y, (^hc/wlt)- 1) A? (7) 

L\,e;c,s,l — 0- a - s .iB(Xi,T e . CjS ) (8) 

where I is the index used for the wavelength-dependent 
quantities. 

In addition, after calculating the emission spectra of 
the individual grains in a grid cell, the total emission in 
a grid cell needs to be calculated by summing over the 
different grain species in the cell. Combining this sum 



with Equation 8 yields 

L\,e;c,l = ^ Lx,e;c,s,in s md;c (9) 
s 

E 2hc 2 a a . S! in s m d -c 
( e W(kA,T eiC , s ) - 1) Af ' ( ' 

where n s is the number of dust grains of species s per unit 
mass of dust, and is the mass of dust in the grid cell. 

The calculation is performed with 4 kernels: The first 
kernel precomputes the quantity <r a;Si / AA;, which is used 
in Equations 6 & 7. The second kernel calculates the heat- 
ing luminosity L h . c _ s . The third solves Equation 7 for the 
equilibrium temperature T e . cs . The fourth kernel calcu- 
lates L\, e:Ci i using Equation 10. 

The thread blocks have specific sizes, such as 16 ele- 
ments of s and 8 elements of I, and the actual number of 
cells, dust species and wavelengths may not be evenly di- 
visable by the block size. For this reason, the problem is 
padded with zeros so no edge checks are necessary in the 
kernels. 

All GPU calculations use single-precision floating point 
numbers. While the GT200 architecture does have double- 
precision support, the double-precision performance is a 
small fraction of that using single-precision. As shown in 
Section 5, the use of single precision does not significantly 
affect the computed results. However, care must be taken 
to avoid under- or overflowing the limited dynamic range, 
as astrophysical numbers can span ranges rivaling that of 
single- precision numbers. 

The GPU also has a fast intrinsic function for calculat- 
ing exponentials in one clock cycle, __expf . This function 
has lower accuracy than the math library function expf , 
but is much faster. It was verified that using __expf had 
no adverse impact on the accuracy of the SED calculation, 
but resulted in about 40% higher overall performance, so 
the results presented here use __expf for all exponential 
calculations. 

4-1. Kernel 1 

The first kernel is trivial. Each thread is assigned an 
index of (s, I), loads its elements of a 8 j and AA;, and saves 
the product. This calculation is not affected by the number 
of cells and always takes negligible time. 

4.2. Kernel 2 

The second kernel calculates the heating luminosity 
Lh; C ,s- Looking at Equation 6, it can be seen that this 
operation actually is a matrix multiplication of (crAA) s j 
and I c j. Each thread is assigned an index of (c,s) and 
calculates one element of Lh- C ,s- To minimize loads from 
global memory, the threads in the block first stages blocks 
of (ctAA) s .; and I c j for an interval in I in shared memory. 
Each thread then performs the partial sum in Equation 6 
for the interval in I that has been staged. The threads then 
continue to stage another interval in I until the full range in 
I has been processed. The matrix multiplication example 
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is extensively discussed in the CUDA Programming Guide 
(NVIDIA Corporation, 2009b). 

4.3. Kernel 3 

The third kernel, which solves for the equilibrium tem- 
perature T e:CtS is the most complex and time-consuming 
one. Each thread is assigned an index (c, s) and, since 
grains with the same s in different grid cells can share 
the same cross-section data, it is advantageous for each 
block to process one grain species over many grid cells. 
This is only possible as long as the entire cross section 
vector for one grain species can fit in the 16kb of shared 
memory, which currently limits the maximum number of 
wavelengths to ~ 4000. (This limit is of no practical con- 
sequence at this time.) 

Each thread then loads its value of the heating Lh- C .s 
and calculates a Newton-Raphson iterative solution of Equa- 
tion 7. The iteration is stopped when the energy non- 
conservation in Equation 7 is below a specified fraction, 
currently 0.1%. 

The number of iterations required is a strong function 
of the accuracy of the initial guess for the grain tempera- 
ture, and because the temperature calculation dominates 
the total computational time (both on the GPU and CPU) 
this determines the total time of the calculation. To de- 
termine a good guess, the relation between Lh and T e was 
studied, for both graphite, silicate and PAH grains. It was 
found that, for these types of grains, a function of the form 

logT w k + fox + (3 2 x 2 , where (11) 
x = log L h + ai log a + a 2 log 2 a , (12) 

where a is the grain size, can be used to provide the ini- 
tial guess. Fitting to the grain cross sections resulted in 
parameters k = 1.86, ft = 0.189, j3 2 = 3.41 x 10~ 3 , 
ai = -1.39, and a 2 = 0.126 (in SI units). With Equa- 
tion 11 as the initial guess, the number of iterations re- 
quired to obtain the temperatures of graphite and silicate 
grains were at most 7, for any grain temperature below 
1000K. The PAH grains sometimes require more iterations, 
but no more than 9. The exception was that if Equation 11 
suggested a temperature of less than 5K, 5K was used. 
This is due to the convergence behavior of the iteration, 
which can be unstable at low temperatures if the initial 
guess is lower than the equilibrium temperature. 

Since the number of iterations required to obtain a so- 
lution of required accuracy may vary between the threads 
in the block, the threads may diverge. In these cases, exe- 
cution of the divergent threads is automatically serialized 
by the hardware. In this case, the threads that require 
fewer iterations will automatically pause and wait for the 
thread that takes the longest. It is thus advantageous to 
make sure the initial guess is good enough that the num- 
ber of iterations required for the different threads in a warp 
are all roughly similar, but there is no need to handle this 
divergence explicitly in the code. 



4-4- Kernel 4 

The final part of the calculation is the generation of the 
total dust emission SED in the grid cells from the equilib- 
rium temperatures using Equation 10. Because this kernel 
uses many different quantities, it has the most complicated 
shared-memory staging. Each thread is indexed by (c, V) 
and calculates L\ e . c i. The sums over s are then done by 
staging blocks of cr Sj ;, T e:c , s , and n s for the interval in s 
covered by the block in shared memory. The layout of this 
calculation is thus largely similar to that of kernel 1 . Af- 
ter the sum is completed, the final quantity is obtained by 
multiplying by m^ c . (Since each value of m^ c is only used 
once, there is no point in staging it in shared memory.) 

After the kernels have finished executing, the result- 
ing SED array is copied back to system memory and, for 
the purpose of the test presented here, compared with the 
SEDs calculated on the CPUs. 

5. Results 

The results presented here compare the CUDA calcula- 
tion, performed on an Nvidia Tesla C1060 high-performace 
computing card with 4GB of memory using CUDA ver- 
sion 2.3, to that performed on an 8-core Intel Xeon E5420 
(2.5 GHz) Linux machine with 32 GB RAM. The code used 
for the measurements in this paper is an improved version 
of sunrise version 3.01 2 , and the CUDA calculation is 
fully integrated into SUNRISE version 3.02. To make the 
comparison more fair, the CPU code was tuned to im- 
prove cache performance and load balancing, and to use 
the improved initial guess in Equation 11, though there 
are surely still improvements that could be made. The 
radiation intensities and dust masses were taken from the 
simulation of the Sbc galaxy presented in Jonsson ct al. 
(2009), with a grid containing 590k cells. The dust grains 
were modeled using the graphite cross sections of Laor 
and Draine (1993) with 81 size bins and the size distribu- 
tion of the Milky- Way R = 3.1 model from Draine and Li 
(2007). The wavelength grid contained 968 wavelengths 
distributed from the Lyman limit to 1000 /im. 

5.1. Accuracy 

To verify that the results of the CUDA calculation are 
correct, they were compared against the extensively tested 
calculation in the standard version of sunrise performed 
on the CPU. The result is shown in Figure 1, which shows 
the mean and variance, over all grid cells, of the relative 
SED calculated on the GPU and CPU. Because the SED 
changes dramatically from one grid cell to another, the 
comparison was done by weighting the cells by the SED 
(as calculated on the CPU) in that cell. If this is not done, 
the large variation in the exponential cutoff at short wave- 
lengths will cause the variance to be dominated by the cells 



2 The code can be retrieved from the SUNRISE SVN repository un- 
der svn/branches/cuda-paper, revision 2400. 
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Figure 1: The dust emission SED calculated by the GPU compared 
to that calculated on the CPU. The line indicates the mean over 
all cells of the SED ratio, weighted by the CPU SED. The shaded 
region indicates the 1<t variance (also SED-wcightcd) over the cells. 
The blow-up at short wavelengths occurs because the GPU SED is 
calculated in single-precision and, for the dust temperatures encoun- 
tered in the problem, the exponential blackbody cutoff will underflow 
to zero at short wavelengths. This increase in variance at short wave- 
lengths has no practical significance as it occurs precisely because the 
actual emission drops to negligible levels. 

with cold dust - precisely the cells that do not contribute 
significantly to the total emission. For wavelengths longer 
than 20 fim, where grains in thermal cquilibrum can be 
expected to have an important impact, the difference be- 
tween the two calculations is about 0.1%, shrinking rapidly 
at longer wavelengths. The difference increases at shorter 
wavelengths, reaching 1% at 4/xm and then blowing up as 
the single-precision exponential cutoff underflows. The la- 
spread around the mean in the results is always negligibly 
small. 

5.2. Performance 

To collect performance measurements, the timers in 
the CUDA runtime library were used to measure the exe- 
cution times of the different kernels, the total CUDA time 
(including all kernel runs and data transfer) , and the CPU 
calculation. These data are shown in Table 1 and Figure 2. 
For large problem sizes, the GPU calculation is 69 times 
faster than the CPU calculation (on 8 cores). The GPU 
execution time is dominated by Kernel 3, the numerical 
temperature solution, which takes 57% of the total time. 
The time used by the 4 kernels adds up to only 79% of the 
total GPU time, the remaining time being data transfer 
to and from the device. 

Using the CUDA Visual Profiler (v2.3), the occupancy 
and memory bandwidth of the kernels were determined. 
Kernels 1 & 2 have full occupancy, while kernels 3 and 4 
have occupancies of 0.5. The time-dominating kernel 3 is 
not very sensitive to the occupancy, as it is not memory- 
bandwidth limited. The global memory bandwidth used 
by kernel 3 is only about 95 MB /s, far below the maximum 
of 102GB/s. (This is not surprising, given the fact that 
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Figure 2: The wall clock time required to calculate the dust emission 
SED of one grid cell, as a function of the total number of cells, for 
the CPU (using 8 cores) and GPU. The time when interpolation was 
used on the CPU is indicated by a point, for the largest number 
of cells only. This problem used the graphite cross sections of Laor 
and Drainc (1993), with 81 size bins and 968 wavelengths. For small 
problem sizes, the GPU time is dominated by kernel launch and data 
transfer overhead and is essentially independent of number of cells 
up to > 10 cells. For the CPU calculation, a block size of 32 cells 
was used, so for 256 cells or less, there are not enough blocks to load 
all 8 cores, so the time is then independent of problem size. This is 
merely an artifact of the block size used. 



the only per-cell quantities used by kernel 3 are L h and 
T e .) 

The instruction throughput of kernel 3 is about 0.7 
instructions per cycle, indicating that the threads are not 
waiting for memory access for a significant amount of time. 
The throughput would presumably be even higher in the 
absence of divergent warps. According to the profiler, 
about 4 x 10~ 5 of all branches are divergent for this kernel. 

6. Discussion 

It is clear from the performance numbers presented 
here that the GPU temperature calculation dramatically 
outperforms the calculation done on an 8-core fairly mod- 
ern CPU. This is not very surprising, given that the the- 
oretical maximum floating point performance of the Tcsla 
unit used here is about 6 times greater than that of the 8 
Xeon cores. What is more surprising is that the difference 
in performance actually is an order of magnitude greater 
than this. 

While the large performance difference shows that the 
calculation performed here is perfectly suited for the mas- 
sively parallel GPU, one part of the explanation is surely 
that the CUDA code is a low-level calculation purposely 
written directly to conform to the rules for getting maxi- 
mum performance out of the GPU, while the CPU C++ 
code is written at a much higher abstraction level that 
largely trusts that the compiler can generate efficient code 
from the Blitz++ expression template matrix library (Veld- 
huizen, 1998) and that the CPU cache machinery can hide 
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Table 1: Execution wall clock times, in seconds, of the GPU and CPU (on 8 processors) calculations for different problem sizes. The times 
are averages of 3 runs. (The kernel execution times were generally very stable, while the times that include data transfer and CPU showed 
larger variation, presumably due to OS background tasks. The large start-up cost of the GPU calculation is evident. These numbers were 
obtained for calculations of graphite grains, but the results were approximately the same if silicate and PAH cross sections were used. 



memory latency. The performance comparisons presented 
here thus say less about the innate performance differ- 
ence between the two architectures and more about the 
speedups that are possible when a small part of an ex- 
isting code is rewritten for maximum performance. It is 
likely that the performance of the CPU code could be fur- 
ther improved by rewriting the CPU calculation to the 
same low level as the CUDA code, meticulously paying 
attention to cache performance and blocking the loops to 
make sure cache thrashing is avoided. The virtue of the 
GPU memory access scheme, however, is that the rules are 
simple and straightforward. Knowing exactly how caches, 
pref etchers, etc., work on any given CPU is much more 
difficult, but even if cache performance on the CPU was 
optimal, there is no way the CPU calculation could out- 
perform the GPU. This is especially true given the heavy 
use of exponentials in the calculation, since the GPU can 
calculate an exponential in a single clock cycle. 

As mentioned earlier, the sunrise temperature calcu- 
lation in production runs is actually done using a linear 
interpolation table, containing 2000 temperature points 
between 3K and 1500 K. The time required is shown in 
the final column of Table 1. Remarkably, calculating the 
SED on the GPU is still more than an order of magnitude 
faster than when interpolating the temperature on the 8 
CPU cores. This is not quite as unexpected as it might 
seen, however. Even if the temperature is obtained in zero 
time, it is necessary to first calculate the heating rate. 
Subsequently, the emission SED must be calculated using 
Equation 10. It can be seen from Table 1 that, on the 
GPU, the temperature calculation (kernel 3) takes about 
70 percent of the total execution time. For the CPU code, 
this fraction is 80 precent, so completely eliminating the 
temperature calculation can only speed up the CPU cal- 
culation by maximally a factor of 5. The actual reduction 



in time by using the interpolation on the CPU is indeed 
very close to 80 percent, which leaves the GPU still more 
than an order of magnitude faster. 

One interesting point is that once a problem has been 
formulated to fit into the CUDA programming model, scal- 
ing to future generations of hardware is virtually ensured. 
As the calculation already has been subdivided into in- 
dependent blocks, these blocks can simply be distributed 
across a larger number of multiprocessors with essentially 
perfect scaling. 

How does the experience here apply to the (much more 
computationally intensive) calculation of thermally fluctu- 
ating grains? In contrast to the calculation here, calculat- 
ing the temperature probability distribution of thermally 
fluctuating grains essentially requires inverting a matrix 
for each grain species (Guhathakurta and Draine, 1989), 
the size of which is determined by the number of temper- 
ature levels in the distribution. If each thread still cal- 
culates the temperature distribution of a specific cell and 
grain species, shared memory use will increase from storing 
cr a:Sj ; to also storing the full transition matrix between the 
temperature levels for that grain species. Storing the ele- 
ments of the temperature probability distribution during 
the calculation will also require more thread- local storage, 
potentially increasing the use of bandwidth to global mem- 
ory. More work needs to be done to determine the best 
way to implement this calculation on the GPU, and this 
will be presented in a future paper. 

7. Summary 

We have presented an implementation of the calcula- 
tion of dust grain equilibrium temperatures in CUDA, and 
compared the performance of this implementation running 
on a GPU with that performed on a normal multicorc 
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CPU. The GPU vastly outperforms the 8 CPU cores, with 
a factor of 69 speedup. This is almost two orders of magni- 
tude faster despite the fact that the difference in theoreti- 
cal maximum floating-point performance is only a factor of 
6, showing that the inherently parallel, exponential-heavy 
nature of the calculation is perfectly suited to the GPU. 
As grain temperature calculations, not the actual transfer 
of radiation, are normally the most computationally ex- 
pensive part of calculating the SED of a galaxy, this holds 
great promise for accelerating such calculations. 
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